Measurement of the effectiveness of transitive sequence comparison, through a third 'intermediate' sequence

نویسنده

  • Mark Gerstein
چکیده

MOTIVATION Transitive sequence matching expands the scope of sequence comparison by re-running the results of a given query against the databank as a new query. This sometimes results in the initial query sequence (Q) being related to a final match (M) indirectly, through a third, 'intermediate' sequence (Q --> I --> M ). This approach has often been suggested as providing greater sensitivity in sequence comparison; however, it has not yet been possible to gauge its improvement precisely. RESULTS Here, this improvement is comprehensively measured by seeing what fraction of the known structural relationships transitive sequence matching can uncover beyond that found by normal pairwise comparison (i.e. direct linkage). The structural relationships are taken from a well-characterized test set, the scop classification of protein structure. Specifically, 2055 known structural similarities (called 'pairs') between distantly related proteins constitute the basic test set. To make the measurement of transitive matching properly, special data sets, called 'baseline sets', are derived from this. They consist of pairs of sequences that have a clear structural relationship that cannot be found by normal sequence comparison (i.e. they cannot be directly linked). Specifically, using standard sequence comparison protocols (FASTA with an e-value cut-off of 0. 001), it is found that the baseline set consists of 1742 pairs. A third intermediate sequence can link 86 of these indirectly (5%), where this third sequence is drawn from the entire, current universe of protein sequences. The number of false positives is minimal. Furthermore, when one considers only the relationships within the test set that correspond to a close structural alignment, the coverage increases considerably. In particular, 862 of the baseline set pairs fit to better than 2.6 A RMS, and transitive matching can find 62 of these (9%). AVAILABILITY All the test data, including precise similarity values calculated from structural alignment, are available in tabular format over the Web from http://bioinfo.mbb. yale.edu/align. CONTACT [email protected]

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Three-component Distillation Columns Sequencing: Including Configurations with Divided-wall Columns

In the present work, the exergy analysis and economic study of 3 different samples of threecomponent mixtures have been investigated (ESI>1, ESI≈1, and ESI<1). The feed mixture has been tested under three different compositions (low, equal, and high contents of the intermediate component). A quantitative comparison between simple and complex configurations, considering thermally coupled, thermo...

متن کامل

An Iterated Greedy Algorithm for Flexible Flow Lines with Sequence Dependent Setup Times to Minimize Total Weighted Completion Time

This paper explores the flexile flow lines where setup times are sequence- dependent. The optimization criterion is the minimization of total weighted completion time. We propose an iterated greedy algorithm (IGA) to tackle the problem. An experimental evaluation is conducted to evaluate the proposed algorithm and, then, the obtained results of IGA are compared against those of some other exist...

متن کامل

The Comparison of the Effectiveness of a Modified Conformation Sensitive Gel Electrophoresis with Denaturing High Performance Liquid Chromatography

Background: Several methods have been developed for detection of sequence variation in genes and each has its advantages and disadvantages. A disadvantage of them is that the simpler, cost-effective methods are commonly perceived as being less sensitive in their detection of sequence variation, whereas those with proven sensitivity have a requirement for complex or expensive laboratory equipmen...

متن کامل

A fuzzy mixed-integer goal programming model for a parallel machine scheduling problem with sequence-dependent setup times and release dates

This paper presents a new mixed-integer goal programming (MIGP) model for a parallel machine scheduling problem with sequence-dependent setup times and release dates. Two objectives are considered in the model to minimize the total weighted flow time and the total weighted tardiness simultaneously. Due to the com-plexity of the above model and uncertainty involved in real-world scheduling probl...

متن کامل

A sequence of targets toward a common best practice frontier in DEA

Original data envelopment analysis models treat decision-making units as independent entities. This feature of data envelopment analysis results in significant diversity in input and output weights, which is irrelevant and problematic from the managerial point of view. In this regard, several methodologies have been developed to measure the efficiency scores based on common weights. Specificall...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Bioinformatics

دوره 14 8  شماره 

صفحات  -

تاریخ انتشار 1998